EXPLORATORY DATA ANALYSIS – DIAMONDS by Wei Tang

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Our dataset consists of 12 variables, with 1599 observations. Quality variable is discrete and the others are continuous.

Red wine quality is normally distributed and concentrated around 5 and 6.

Red wine quality is normally distributed and concentrated on average.

Tip: Make sure that you leave a blank line between the start / end of each code block and the end / start of your Markdown text so that it is formatted nicely in the knitted text. Note as well that text on consecutive lines is treated as a single space. Make sure you have a blank line between your paragraphs so that they too are formatted for easy readability.

The distribution of fixed acidity is right skewed, and concentrated around 8

cutting the outliers

The distribution of citric acid is not normal.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The distribution of chlorides is right skewed, and concentrated around 0.08 There are a few outliers on this plot.

The distribution of chlorides is right skewed There are a few outliers on this plot.

The distribution of chlorides is right skewed There are a few outliers on this plot.

The distribution of density is normal and concentrated around 0.9967

We divide the data into 2 groups: high quality group contains observations whose quality is 7 or 8, and low quality group has observations whose quality is 3 or 4. After examining the difference in each feature between the two groups, we see that volatile acidity, density, and citric acid may have some correation with quality. Let’s visualize the data to see the difference.

The Low volatility,the better quality.

it seems that density has no correlation with quality

the higher the citric.acid ,the better the quality is

Univariate Analysis

Tip: Now that you’ve completed your univariate explorations, it’s time to reflect on and summarize what you’ve found. Use the questions below to help you gather your observations and add your own if you have other thoughts!

What is the structure of your dataset?

There are 1,599 red wines in the dataset with 11 features on the chemical properties of the wine. ( fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, and quality).

What is/are the main feature(s) of interest in your dataset?

i would like to know which factor determine or has corelation with the quality of the wine.

What other features in the dataset do you think will help support yourinvestigation into your feature(s) of interest?

Volatile acidity, citric acid, and alcohol seems contribute to the quality of a wine.However, density seems has no relationship with the qualoty of the wine.

I think volatile acidityl,citric acid, alcohol probably contribute most to the quality.

Did you create any new variables from existing variables in the dataset?

Yes, I created a new variable quality.level. It divided into the quality as “low”, “average”, and “high”.

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this? This dataset is preety clean, I think i dont have to do the cleaning job.

Bivariate Plots Section

Tip: Based on what you saw in the univariate plots, what relationships between variables might be interesting to look at in this section? Don’t limit yourself to relationships between a main output feature and one of the supporting variables. Try to look at relationships between supporting variables as well.

The graph shows a very clear trend; the lower volatile acidity is, the higher the quality becomes.

The graph shows there is no positive relationship between quality level and citric acid.

The graph shows there is no positive relationship between quality level and citric acid.

The correlation coefficient of 0.476, the graph shows a positive relationship between alcohol and quality level.

A weak negative correlation of -0.2 exists between percent alcohol content and volatile acidity.

The correlation coefficient is 0.04, which indicates that there is almost no relationship between residual sugar and percent alcohol content. However,the most wine are contrented on the low sugar area.

There is a negative correlation between citric acid and volatile acidity.

Bivariate Analysis

Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

I found a negative relationships between quality level and volatile acidity, and positive correlation between quality level and alcohol.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

The correlation coefficient between sugar and alcohol is 0.04, which indicates that there is almost no relationship between residual sugar and percent alcohol content. However,the most wine are contrented on the low sugar area.

What was the strongest relationship you found?

a negative relationships between quality level and volatile acidity, and positive correlation between quality level and alcohol.

Multivariate Plots Section

Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.

The densities of high quality wines are concentrated between 0.994 and 0.998, and the lower part of volatile acidity.

The alcohol ranging from 10 to 13,the volatile acidity from 0.2 to 0.5 seems to be high quality wine.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The densities of high quality wines are concentrated between 0.994 and 0.998, and the lower part of volatile acidity.

Were there any interesting or surprising interactions between features?

The alcohol ranging from 10 to 13,the volatile acidity from 0.2 to 0.5 seems to be high quality wine.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.

Plot One

Description One

This plot revels that the marjority part of wines are rated 5 and 6. There exists no wine that is rated 1, 2, 9 or 10.In other word,there is no extramely bad quality wine and no extramely good wine.

Plot Two

Description Two

The graph shows a very clear trend; the lower volatile acidity is, the higher the quality becomes.

Plot Three

Description Three

The alcohol ranging from 10 to 13,the volatile acidity from 0.2 to 0.5 seems to be high quality wine.

Reflection

The wines data set contains data on 1599 wines columns and 12 variables from around 2009. I started by asking some questions,then explored with the dataset,created several plots to answer my questions. First of all,I made a assumption that factors might affect the quality of the wine. For example, pH was negatively correlated to volatile acidity, which makes sense.

I created a linear model to attempt to predict red wine qualities, which was accurate for average wines but extremely inaccurate for bad/excellent wines; it either over predicted bad wines and underpredicted the good ones.

Alcohol appeared to be the number one factor for determining an excellent wine. However,Citric acid and sulphates had to be in specific amounts in order for alcohol to take over. Volatile acidity made a wine bad in large amounts.

The most diffcult part for me was spenting a lot of time to play around with those 12 variable then decide which fact is interesting. Also,I struggled with choosing the most appropriate graph for the plots.